DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
The train.csv data set provided by DonorsChoose contains the following features:
| Feature | Description |
|---|---|
project_id |
A unique identifier for the proposed project. Example: p036502 |
project_title |
Title of the project. Examples:
|
project_grade_category |
Grade level of students for which the project is targeted. One of the following enumerated values:
|
project_subject_categories |
One or more (comma-separated) subject categories for the project from the following enumerated list of values:
Examples:
|
school_state |
State where school is located (Two-letter U.S. postal code). Example: WY |
project_subject_subcategories |
One or more (comma-separated) subject subcategories for the project. Examples:
|
project_resource_summary |
An explanation of the resources needed for the project. Example:
|
project_essay_1 |
First application essay* |
project_essay_2 |
Second application essay* |
project_essay_3 |
Third application essay* |
project_essay_4 |
Fourth application essay* |
project_submitted_datetime |
Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245 |
teacher_id |
A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56 |
teacher_prefix |
Teacher's title. One of the following enumerated values:
|
teacher_number_of_previously_posted_projects |
Number of project applications previously submitted by the same teacher. Example: 2 |
* See the section Notes on the Essay Data for more details about these features.
Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:
| Feature | Description |
|---|---|
id |
A project_id value from the train.csv file. Example: p036502 |
description |
Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25 |
quantity |
Quantity of the resource required. Example: 3 |
price |
Price of the resource required. Example: 9.95 |
Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:
The data set contains the following label (the value you will attempt to predict):
| Label | Description |
|---|---|
project_is_approved |
A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved. |
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm
import os
from chart_studio import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter
project_data = pd.read_csv('train_data.csv')
resource_data = pd.read_csv('resources.csv')
print("Number of data points in train data", project_data.shape)
print('-'*100)
print("The features of data :", project_data.columns.values)
print('-'*100)
project_data.head(2)
Number of data points in train data (109248, 17) ---------------------------------------------------------------------------------------------------- The features of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state' 'project_submitted_datetime' 'project_grade_category' 'project_subject_categories' 'project_subject_subcategories' 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3' 'project_essay_4' 'project_resource_summary' 'teacher_number_of_previously_posted_projects' 'project_is_approved'] ----------------------------------------------------------------------------------------------------
| Unnamed: 0 | id | teacher_id | teacher_prefix | school_state | project_submitted_datetime | project_grade_category | project_subject_categories | project_subject_subcategories | project_title | project_essay_1 | project_essay_2 | project_essay_3 | project_essay_4 | project_resource_summary | teacher_number_of_previously_posted_projects | project_is_approved | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 160221 | p253737 | c90749f5d961ff158d4b4d1e7dc665fc | Mrs. | IN | 2016-12-05 13:43:57 | Grades PreK-2 | Literacy & Language | ESL, Literacy | Educational Support for English Learners at Home | My students are English learners that are work... | \"The limits of your language are the limits o... | NaN | NaN | My students need opportunities to practice beg... | 0 | 0 |
| 1 | 140945 | p258326 | 897464ce9ddc600bced1151f324dd63a | Mr. | FL | 2016-10-25 09:22:10 | Grades 6-8 | History & Civics, Health & Sports | Civics & Government, Team Sports | Wanted: Projector for Hungry Learners | Our students arrive to our school eager to lea... | The projector we need for our school is very c... | NaN | NaN | My students need a projector to help with view... | 7 | 1 |
print("Number of data points in resource data", resource_data.shape)
print('-'*100)
print("The features of resource data :", resource_data.columns.values)
print('-'*100)
resource_data.head(2)
Number of data points in resource data (1541272, 4) ---------------------------------------------------------------------------------------------------- The features of resource data : ['id' 'description' 'quantity' 'price'] ----------------------------------------------------------------------------------------------------
| id | description | quantity | price | |
|---|---|---|---|---|
| 0 | p233245 | LC652 - Lakeshore Double-Space Mobile Drying Rack | 1 | 149.00 |
| 1 | p069063 | Bouncy Bands for Desks (Blue support pipes) | 3 | 14.95 |
# how to replace elements in list python: https://stackoverflow.com/a/2582163/4084039
cols = ['Date' if x=='project_submitted_datetime' else x for x in list(project_data.columns)]
#sort dataframe based on time pandas python: https://stackoverflow.com/a/49702492/4084039
project_data['Date'] = pd.to_datetime(project_data['project_submitted_datetime'])
project_data.drop('project_submitted_datetime', axis=1, inplace=True)
project_data.sort_values(by=['Date'], inplace=True)
# how to reorder columns pandas python: https://stackoverflow.com/a/13148611/4084039
project_data = project_data[cols]
project_data.head(2)
| Unnamed: 0 | id | teacher_id | teacher_prefix | school_state | Date | project_grade_category | project_subject_categories | project_subject_subcategories | project_title | project_essay_1 | project_essay_2 | project_essay_3 | project_essay_4 | project_resource_summary | teacher_number_of_previously_posted_projects | project_is_approved | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55660 | 8393 | p205479 | 2bf07ba08945e5d8b2a3f269b2b3cfe5 | Mrs. | CA | 2016-04-27 00:27:36 | Grades PreK-2 | Math & Science | Applied Sciences, Health & Life Science | Engineering STEAM into the Primary Classroom | I have been fortunate enough to use the Fairy ... | My students come from a variety of backgrounds... | Each month I try to do several science or STEM... | It is challenging to develop high quality scie... | My students need STEM kits to learn critical s... | 53 | 1 |
| 76127 | 37728 | p043609 | 3f60494c61921b3b43ab61bdde2904df | Ms. | UT | 2016-04-27 00:31:25 | Grades 3-5 | Special Needs | Special Needs | Sensory Tools for Focus | Imagine being 8-9 years old. You're in your th... | Most of my students have autism, anxiety, anot... | It is tough to do more than one thing at a tim... | When my students are able to calm themselves d... | My students need Boogie Boards for quiet senso... | 4 | 1 |
y_value_counts = project_data['project_is_approved'].value_counts()
print(y_value_counts)
1 92706 0 16542 Name: project_is_approved, dtype: int64
project_data.isnull().sum()
Unnamed: 0 0 id 0 teacher_id 0 teacher_prefix 3 school_state 0 Date 0 project_grade_category 0 project_subject_categories 0 project_subject_subcategories 0 project_title 0 project_essay_1 0 project_essay_2 0 project_essay_3 105490 project_essay_4 105490 project_resource_summary 0 teacher_number_of_previously_posted_projects 0 project_is_approved 0 dtype: int64
project_data['teacher_prefix'] = project_data['teacher_prefix'].fillna('Mrs.')
# merge two column text dataframe:
project_data["essay"] = project_data["project_essay_1"].map(str) +\
project_data["project_essay_2"].map(str) + \
project_data["project_essay_3"].map(str) + \
project_data["project_essay_4"].map(str)
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"]
# Combining all the above stundents
def Text_cleaner(data):
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(data.values):
sent = decontracted(sentance)
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
sent = sent.replace('nan',' ')
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
# https://gist.github.com/sebleier/554280
sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
preprocessed_essays.append(sent.lower().strip())
return preprocessed_essays
# after preprocesing
preprocessed_essays=Text_cleaner(project_data['essay'])
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109248/109248 [01:18<00:00, 1391.00it/s]
project_data['essay'] = preprocessed_essays
project_data.drop(['project_essay_1'], axis=1, inplace=True)
project_data.drop(['project_essay_2'], axis=1, inplace=True)
project_data.drop(['project_essay_3'], axis=1, inplace=True)
project_data.drop(['project_essay_4'], axis=1, inplace=True)
project_title¶# after preprocesing
preprocessed_title=Text_cleaner(project_data['project_title'])
project_data['cleaned_title']=preprocessed_title
project_data.drop(['project_title'], axis=1, inplace=True)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109248/109248 [00:03<00:00, 35128.45it/s]
project_resource_summary¶# similarly you can preprocess the summary also
preprocessed_project_summary=Text_cleaner(project_data['project_resource_summary'])
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109248/109248 [00:07<00:00, 14996.63it/s]
project_data['cleaned_summary']=preprocessed_project_summary
project_data.drop(['project_resource_summary'], axis=1, inplace=True)
project_subject_categories¶catogories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
temp = ""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
temp = temp.replace('&','_') # we are replacing the & value into
cat_list.append(temp.strip())
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)
from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
my_counter.update(word.split())
cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))
project_subject_subcategories¶sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
sub_cat_list = []
for i in sub_catogories:
temp = ""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
temp = temp.replace('&','_')
sub_cat_list.append(temp.strip())
project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
my_counter.update(word.split())
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))
project_grade_category¶## we need to remove the spaces, replace the '-' with '_' and convert all the letters to small
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace(' ','_')
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace('-','_')
project_data['project_grade_category'] = project_data['project_grade_category'].str.lower()
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
[nltk_data] Downloading package vader_lexicon to [nltk_data] C:\Users\sonaw\AppData\Roaming\nltk_data... [nltk_data] Package vader_lexicon is already up-to-date!
## https://www.nltk.org/howto/sentiment.html
## It returns 4 scores as follows --> neg: 0.0, neu: 0.753, pos: 0.247, compound: 0.93 and will add these 4 features
sid = SentimentIntensityAnalyzer()
negative = []
positive = []
neutral = []
compound = []
for sent in tqdm(project_data['essay']):
neg = sid.polarity_scores(sent)['neg']
neu = sid.polarity_scores(sent)['neu']
pos = sid.polarity_scores(sent)['pos']
comp = sid.polarity_scores(sent)['compound']
negative.append(neg)
positive.append(pos)
neutral.append(neu)
compound.append(comp)
project_data['essay_sentiment_negative'] = negative
project_data['essay_sentiment_positive'] = positive
project_data['essay_sentiment_neutral'] = neutral
project_data['essay_sentiment_compound'] = compound
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109248/109248 [14:20<00:00, 127.03it/s]
## Creating new feature: title_word_count
title_word_count = []
for i in tqdm(project_data['cleaned_title']):
title_word_count.append(len(i.split()))
project_data['title_word_count'] = title_word_count
project_data.head(2)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109248/109248 [00:00<00:00, 970439.74it/s]
| Unnamed: 0 | id | teacher_id | teacher_prefix | school_state | Date | project_grade_category | teacher_number_of_previously_posted_projects | project_is_approved | essay | cleaned_title | cleaned_summary | clean_categories | clean_subcategories | essay_sentiment_negative | essay_sentiment_positive | essay_sentiment_neutral | essay_sentiment_compound | title_word_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55660 | 8393 | p205479 | 2bf07ba08945e5d8b2a3f269b2b3cfe5 | Mrs. | CA | 2016-04-27 00:27:36 | grades_prek_2 | 53 | 1 | fortunate enough use fairy tale stem kits clas... | engineering steam primary classroom | students need stem kits learn critical science... | Math_Science | AppliedSciences Health_LifeScience | 0.013 | 0.214 | 0.773 | 0.9867 | 4 |
| 76127 | 37728 | p043609 | 3f60494c61921b3b43ab61bdde2904df | Ms. | UT | 2016-04-27 00:31:25 | grades_3_5 | 4 | 1 | imagine 8 9 years old third grade classroom se... | sensory tools focus | students need boogie boards quiet sensory brea... | SpecialNeeds | SpecialNeeds | 0.078 | 0.272 | 0.650 | 0.9899 | 3 |
price_data = resource_data.groupby('id').agg({'price':'sum','quantity':'sum'}).reset_index()
project_data = pd.merge(project_data, price_data, on = 'id',how = 'left')
project_data.head(2)
| Unnamed: 0 | id | teacher_id | teacher_prefix | school_state | Date | project_grade_category | teacher_number_of_previously_posted_projects | project_is_approved | essay | ... | cleaned_summary | clean_categories | clean_subcategories | essay_sentiment_negative | essay_sentiment_positive | essay_sentiment_neutral | essay_sentiment_compound | title_word_count | price | quantity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8393 | p205479 | 2bf07ba08945e5d8b2a3f269b2b3cfe5 | Mrs. | CA | 2016-04-27 00:27:36 | grades_prek_2 | 53 | 1 | fortunate enough use fairy tale stem kits clas... | ... | students need stem kits learn critical science... | Math_Science | AppliedSciences Health_LifeScience | 0.013 | 0.214 | 0.773 | 0.9867 | 4 | 725.05 | 4 |
| 1 | 37728 | p043609 | 3f60494c61921b3b43ab61bdde2904df | Ms. | UT | 2016-04-27 00:31:25 | grades_3_5 | 4 | 1 | imagine 8 9 years old third grade classroom se... | ... | students need boogie boards quiet sensory brea... | SpecialNeeds | SpecialNeeds | 0.078 | 0.272 | 0.650 | 0.9899 | 3 | 213.03 | 8 |
2 rows × 21 columns
y = project_data['project_is_approved']
x = project_data.drop(['project_is_approved'],axis=1)
x.head(1)
| Unnamed: 0 | id | teacher_id | teacher_prefix | school_state | Date | project_grade_category | teacher_number_of_previously_posted_projects | essay | cleaned_title | cleaned_summary | clean_categories | clean_subcategories | essay_sentiment_negative | essay_sentiment_positive | essay_sentiment_neutral | essay_sentiment_compound | title_word_count | price | quantity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8393 | p205479 | 2bf07ba08945e5d8b2a3f269b2b3cfe5 | Mrs. | CA | 2016-04-27 00:27:36 | grades_prek_2 | 53 | fortunate enough use fairy tale stem kits clas... | engineering steam primary classroom | students need stem kits learn critical science... | Math_Science | AppliedSciences Health_LifeScience | 0.013 | 0.214 | 0.773 | 0.9867 | 4 | 725.05 | 4 |
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,stratify=y,random_state=42)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(76473, 20) (76473,) (32775, 20) (32775,)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=10,ngram_range=(1,4),max_features=5000)
vectorizer.fit(x_train['essay'])
# we use the fitted CountVectorizer to convert the text to vector
x_train_essay_bow = vectorizer.transform(x_train['essay'].values)
x_test_essay_bow = vectorizer.transform(x_test['essay'].values)
#print(x_train_essay_bow)
print("After vectorizations")
print(x_train_essay_bow.shape, y_train.shape)
print(x_test_essay_bow.shape, y_test.shape)
feature_names_bow = []
#print(vectorizer.get_feature_names())
feature_names_bow.extend(vectorizer.get_feature_names())
After vectorizations (76473, 5000) (76473,) (32775, 5000) (32775,)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=10,max_features=500)
vectorizer.fit(x_train['cleaned_title'])
# we use the fitted CountVectorizer to convert the text to vector
x_train_title_bow = vectorizer.transform(x_train['cleaned_title'].values)
x_test_title_bow = vectorizer.transform(x_test['cleaned_title'].values)
#print(x_train_essay_bow)
print("After vectorizations")
print(x_train_title_bow.shape, y_train.shape)
print(x_test_title_bow.shape, y_test.shape)
#print(vectorizer.get_feature_names())
feature_names_bow.extend(vectorizer.get_feature_names())
After vectorizations (76473, 500) (76473,) (32775, 500) (32775,)
vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,4),max_features=5000)
vectorizer.fit(x_train['essay'])
# we use the fitted TfidfVectorizer to convert the text to vector
x_train_essay_tfidf = vectorizer.transform(x_train['essay'].values)
x_test_essay_tfidf = vectorizer.transform(x_test['essay'].values)
#print(x_train_essay_tfidf)
print("After vectorizations")
print(x_train_essay_tfidf.shape, y_train.shape)
print(x_test_essay_tfidf.shape, y_test.shape)
feature_names_tfidf = []
#print(vectorizer.get_feature_names())
feature_names_tfidf.extend(vectorizer.get_feature_names())
After vectorizations (76473, 5000) (76473,) (32775, 5000) (32775,)
vectorizer = TfidfVectorizer(min_df=10,max_features=500)
vectorizer.fit(x_train['cleaned_title'])
# we use the fitted TfidfVectorizer to convert the text to vector
x_train_title_tfidf = vectorizer.transform(x_train['cleaned_title'].values)
x_test_title_tfidf = vectorizer.transform(x_test['cleaned_title'].values)
#print(x_train_essay_bow)
print("After vectorizations")
print(x_train_title_tfidf.shape, y_train.shape)
print(x_test_title_tfidf.shape, y_test.shape)
#print(vectorizer.get_feature_names())
feature_names_tfidf.extend(vectorizer.get_feature_names())
After vectorizations (76473, 500) (76473,) (32775, 500) (32775,)
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.preprocessing import OneHotEncoder
vectorizer = OneHotEncoder(handle_unknown='ignore') ## One hot encoding
vectorizer.fit(x_train['clean_categories'].values.reshape(-1,1))
x_train_categories = vectorizer.transform(x_train['clean_categories'].values.reshape(-1,1))
x_test_categories = vectorizer.transform(x_test['clean_categories'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_categories.shape, y_train.shape)
print(x_test_categories.shape, y_test.shape)
#print(vectorizer.get_feature_names())
feature_names_bow.extend(vectorizer.get_feature_names())
feature_names_tfidf.extend(vectorizer.get_feature_names())
After vectorizations (76473, 51) (76473,) (32775, 51) (32775,)
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
vectorizer.fit(x_train['clean_subcategories'].values.reshape(-1,1))
x_train_subcategories = vectorizer.transform(x_train['clean_subcategories'].values.reshape(-1,1))
x_test_subcategories = vectorizer.transform(x_test['clean_subcategories'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_subcategories.shape, y_train.shape)
print(x_test_subcategories.shape, y_test.shape)
#print(vectorizer.get_feature_names())
feature_names_bow.extend(vectorizer.get_feature_names())
feature_names_tfidf.extend(vectorizer.get_feature_names())
After vectorizations (76473, 392) (76473,) (32775, 392) (32775,)
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
vectorizer.fit(x_train['school_state'].values.reshape(-1,1))
x_train_school_state = vectorizer.transform(x_train['school_state'].values.reshape(-1,1))
x_test_school_state = vectorizer.transform(x_test['school_state'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_school_state.shape, y_train.shape)
print(x_test_school_state.shape, y_test.shape)
print(vectorizer.get_feature_names())
feature_names_tfidf.extend(vectorizer.get_feature_names())
feature_names_bow.extend(vectorizer.get_feature_names())
After vectorizations (76473, 51) (76473,) (32775, 51) (32775,) ['x0_AK' 'x0_AL' 'x0_AR' 'x0_AZ' 'x0_CA' 'x0_CO' 'x0_CT' 'x0_DC' 'x0_DE' 'x0_FL' 'x0_GA' 'x0_HI' 'x0_IA' 'x0_ID' 'x0_IL' 'x0_IN' 'x0_KS' 'x0_KY' 'x0_LA' 'x0_MA' 'x0_MD' 'x0_ME' 'x0_MI' 'x0_MN' 'x0_MO' 'x0_MS' 'x0_MT' 'x0_NC' 'x0_ND' 'x0_NE' 'x0_NH' 'x0_NJ' 'x0_NM' 'x0_NV' 'x0_NY' 'x0_OH' 'x0_OK' 'x0_OR' 'x0_PA' 'x0_RI' 'x0_SC' 'x0_SD' 'x0_TN' 'x0_TX' 'x0_UT' 'x0_VA' 'x0_VT' 'x0_WA' 'x0_WI' 'x0_WV' 'x0_WY']
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
vectorizer.fit(x_train['teacher_prefix'].values.reshape(-1,1))
x_train_teacher_prefix = vectorizer.transform(x_train['teacher_prefix'].values.reshape(-1,1))
x_test_teacher_prefix = vectorizer.transform(x_test['teacher_prefix'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_teacher_prefix.shape, y_train.shape)
print(x_test_teacher_prefix.shape, y_test.shape)
print(vectorizer.get_feature_names())
feature_names_tfidf.extend(vectorizer.get_feature_names())
feature_names_bow.extend(vectorizer.get_feature_names())
After vectorizations (76473, 5) (76473,) (32775, 5) (32775,) ['x0_Dr.' 'x0_Mr.' 'x0_Mrs.' 'x0_Ms.' 'x0_Teacher']
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
vectorizer.fit(x_train['project_grade_category'].values.reshape(-1,1))
x_train_project_grade_category = vectorizer.transform(x_train['project_grade_category'].values.reshape(-1,1))
x_test_project_grade_category = vectorizer.transform(x_test['project_grade_category'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_project_grade_category.shape, y_train.shape)
print(x_test_project_grade_category.shape, y_test.shape)
print(vectorizer.get_feature_names())
feature_names_tfidf.extend(vectorizer.get_feature_names())
feature_names_bow.extend(vectorizer.get_feature_names())
After vectorizations (76473, 4) (76473,) (32775, 4) (32775,) ['x0_grades_3_5' 'x0_grades_6_8' 'x0_grades_9_12' 'x0_grades_prek_2']
from sklearn.preprocessing import MinMaxScaler
normalizer = MinMaxScaler()
normalizer.fit(x_train['price'].values.reshape(-1,1))
x_train_price_norm = normalizer.transform(x_train['price'].values.reshape(-1,1))
x_test_price_norm = normalizer.transform(x_test['price'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_price_norm.shape, y_train.shape)
print(x_test_price_norm.shape, y_test.shape)
print("="*100)
After vectorizations (76473, 1) (76473,) (32775, 1) (32775,) ====================================================================================================
normalizer.fit(x_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
x_train_no_of_projects_norm = normalizer.transform(x_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
x_test_no_of_projects_norm = normalizer.transform(x_test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_no_of_projects_norm.shape, y_train.shape)
print(x_test_no_of_projects_norm.shape, y_test.shape)
After vectorizations (76473, 1) (76473,) (32775, 1) (32775,)
normalizer.fit(x_train['essay_sentiment_positive'].values.reshape(-1,1))
x_train_essay_sentiment_positive_norm = normalizer.transform(x_train['essay_sentiment_positive'].values.reshape(-1,1))
x_test_essay_sentiment_positive_norm = normalizer.transform(x_test['essay_sentiment_positive'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_essay_sentiment_positive_norm.shape, y_train.shape)
print(x_test_essay_sentiment_positive_norm.shape, y_test.shape)
After vectorizations (76473, 1) (76473,) (32775, 1) (32775,)
normalizer.fit(x_train['essay_sentiment_neutral'].values.reshape(-1,1))
x_train_essay_sentiment_neutral_norm = normalizer.transform(x_train['essay_sentiment_neutral'].values.reshape(-1,1))
x_test_essay_sentiment_neutral_norm = normalizer.transform(x_test['essay_sentiment_neutral'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_essay_sentiment_neutral_norm.shape, y_train.shape)
print(x_test_essay_sentiment_neutral_norm.shape, y_test.shape)
After vectorizations (76473, 1) (76473,) (32775, 1) (32775,)
normalizer.fit(x_train['essay_sentiment_compound'].values.reshape(-1,1))
x_train_essay_sentiment_compound_norm = normalizer.transform(x_train['essay_sentiment_compound'].values.reshape(-1,1))
x_test_essay_sentiment_compound_norm = normalizer.transform(x_test['essay_sentiment_compound'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_essay_sentiment_compound_norm.shape, y_train.shape)
print(x_test_essay_sentiment_compound_norm.shape, y_test.shape)
After vectorizations (76473, 1) (76473,) (32775, 1) (32775,)
normalizer.fit(x_train['essay_sentiment_negative'].values.reshape(-1,1))
x_train_essay_sentiment_negative_norm = normalizer.transform(x_train['essay_sentiment_negative'].values.reshape(-1,1))
x_test_essay_sentiment_negative_norm = normalizer.transform(x_test['essay_sentiment_negative'].values.reshape(-1,1))
print("After vectorizations")
print(x_train_essay_sentiment_negative_norm.shape, y_train.shape)
print(x_test_essay_sentiment_negative_norm.shape, y_test.shape)
After vectorizations (76473, 1) (76473,) (32775, 1) (32775,)
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
## BOW related feature along with other categorical and numerical features
X_tr_bow = hstack((x_train_essay_bow, x_train_title_bow, x_train_categories, x_train_subcategories, x_train_school_state,
x_train_teacher_prefix,x_train_project_grade_category,x_train_price_norm,x_train_no_of_projects_norm,
x_train_essay_sentiment_positive_norm,x_train_essay_sentiment_neutral_norm,x_train_essay_sentiment_compound_norm,
x_train_essay_sentiment_negative_norm)).tocsr()
X_te_bow = hstack((x_test_essay_bow, x_test_title_bow, x_test_categories, x_test_subcategories, x_test_school_state, x_test_teacher_prefix,
x_test_project_grade_category,x_test_price_norm,x_test_no_of_projects_norm,
x_test_essay_sentiment_positive_norm,x_test_essay_sentiment_neutral_norm,x_test_essay_sentiment_compound_norm,
x_test_essay_sentiment_negative_norm)).tocsr()
## Tfidf related feature along with other categorical and numerical features
X_tr_tfidf = hstack((x_train_essay_tfidf, x_train_title_tfidf, x_train_categories, x_train_subcategories, x_train_school_state,
x_train_teacher_prefix,x_train_project_grade_category,x_train_price_norm,x_train_no_of_projects_norm,
x_train_essay_sentiment_positive_norm,x_train_essay_sentiment_neutral_norm,x_train_essay_sentiment_compound_norm,
x_train_essay_sentiment_negative_norm)).tocsr()
X_te_tfidf = hstack((x_test_essay_tfidf, x_test_title_bow, x_test_categories, x_test_subcategories, x_test_school_state, x_test_teacher_prefix,
x_test_project_grade_category,x_test_price_norm,x_test_no_of_projects_norm,
x_test_essay_sentiment_positive_norm,x_test_essay_sentiment_neutral_norm,x_test_essay_sentiment_compound_norm,
x_test_essay_sentiment_negative_norm)).tocsr()
print("Final Data matrix Using BOW")
print(X_tr_bow.shape, y_train.shape)
print(X_te_bow.shape, y_test.shape)
print("="*100)
print("Final Data matrix Using TFIDF")
print(X_tr_tfidf.shape, y_train.shape)
print(X_te_tfidf.shape, y_test.shape)
print("="*100)
Final Data matrix Using BOW (76473, 6009) (76473,) (32775, 6009) (32775,) ==================================================================================================== Final Data matrix Using TFIDF (76473, 6009) (76473,) (32775, 6009) (32775,) ====================================================================================================
## HyperParameter tuning using GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import GridSearchCV
nb = MultinomialNB()
params = {'alpha':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 3, 5, 10, 30]}
clf = GridSearchCV(nb, params, scoring='roc_auc', return_train_score=True, cv=3, n_jobs=-1)
clf.fit(X_tr_bow,y_train)
GridSearchCV(cv=3, estimator=MultinomialNB(), n_jobs=-1,
param_grid={'alpha': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05,
0.1, 0.5, 1, 3, 5, 10, 30]},
return_train_score=True, scoring='roc_auc')
def fit_nb_and_plot_aucVsAlpha(clf):
results = pd.DataFrame(clf.cv_results_)
results.sort_values(by=['param_alpha'])
train_auc = results['mean_train_score'].values
cv_auc = results['mean_test_score'].values
alpha = results['param_alpha'].values
print('Best hyper parameter:', clf.best_params_)
print('Best score:',clf.best_score_)
plt.plot(alpha, cv_auc, label='CV AUC')
plt.plot(alpha, train_auc, label='Train AUC')
plt.scatter(alpha, train_auc, label='Train AUC points')
plt.scatter(alpha, cv_auc, label='CV AUC points')
plt.legend()
plt.xlabel("Alpha: hyperparameter")
plt.ylabel("AUC")
plt.title("Hyper parameter Vs AUC plot")
plt.grid()
plt.show()
fit_nb_and_plot_aucVsAlpha(clf)
Best hyper parameter: {'alpha': 0.5}
Best score: 0.6877829465017751
#Defining a function to predict the prob. values for given data
def pred_proba(clf,data):
return clf.predict_proba(data)[:,1]
from sklearn.metrics import roc_curve,auc
nb_bow = MultinomialNB(alpha = clf.best_params_['alpha'])
nb_bow.fit(X_tr_bow,y_train)
MultinomialNB(alpha=0.5)
def predict_and_plot_roc(x_train,x_test,y_train,y_test,clf):
# predicting train,test prob scores
y_train_pred = pred_proba(clf,x_train)
y_test_pred = pred_proba(clf,x_test)
# fetching fpr,tpr,threshold values for train,test data
train_fpr, train_tpr, tr_threshold = roc_curve(y_train,y_train_pred)
test_fpr, test_tpr, te_throshold = roc_curve(y_test,y_test_pred)
# plotting the ROC curve on both train and test.
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("ROC Curve")
plt.grid()
plt.show()
return train_fpr, train_tpr, tr_threshold, test_fpr, test_tpr, te_throshold, y_train_pred,y_test_pred
train_fpr, train_tpr, tr_threshold, test_fpr, test_tpr, te_throshold, y_train_pred,y_test_pred = predict_and_plot_roc(X_tr_bow, X_te_bow, y_train, y_test, nb_bow)
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def find_best_threshold(threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
return t
def predict_with_best_t(proba, threshould):
predictions = []
for i in proba:
if i>=threshould:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
import numpy as np
print("="*100)
best_t_bow = find_best_threshold(tr_threshold,train_tpr,train_tpr)
print("Train confusion matrix")
train_confusion_matrix_bow = confusion_matrix(y_train,predict_with_best_t(y_train_pred,best_t_bow))
print(train_confusion_matrix_bow)
print("="*100)
print("Test confusion matrix")
test_confusion_matrix_bow = confusion_matrix(y_test,predict_with_best_t(y_test_pred,best_t_bow))
print(test_confusion_matrix_bow)
==================================================================================================== the maximum value of tpr*(1-fpr) 0.24999999976253975 for threshold 0.99 Train confusion matrix [[ 9250 2329] [32448 32446]] ==================================================================================================== Test confusion matrix [[ 3789 1174] [13927 13885]]
## https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
## https://blog.quantinsti.com/creating-heatmap-using-python-seaborn/
def plot_confusion_matrix(train_confusion_matrix,test_confusion_matrix):
key = (np.asarray([['TN','FP'], ['FN', 'TP']]))
fig, ax = plt.subplots(1,2, figsize=(20,6))
labels_train = (np.asarray(["{0} = {1}" .format(key, value) for key, value in zip(key.flatten(), train_confusion_matrix.flatten())])).reshape(2,2)
labels_test = (np.asarray(["{0} = {1}" .format(key, value) for key, value in zip(key.flatten(), test_confusion_matrix.flatten())])).reshape(2,2)
sns.heatmap(train_confusion_matrix, linewidths=.5, xticklabels=['PREDICTED : NO', 'PREDICTED : YES'], yticklabels=['ACTUAL : NO', 'ACTUAL : YES'], annot = labels_train, fmt = '', ax=ax[0])
sns.heatmap(test_confusion_matrix, linewidths=.5, xticklabels=['PREDICTED : NO', 'PREDICTED : YES'], yticklabels=['ACTUAL : NO', 'ACTUAL : YES'], annot = labels_test, fmt = '', ax=ax[1])
ax[0].set_title('Train Set')
ax[1].set_title('Test Set')
plt.show()
plot_confusion_matrix(train_confusion_matrix_bow,test_confusion_matrix_bow)
nb = MultinomialNB()
params = {'alpha':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 3, 5, 10, 30]}
clf = GridSearchCV(nb, params, scoring='roc_auc', return_train_score=True, cv=3, n_jobs=-1)
clf.fit(X_tr_tfidf,y_train)
GridSearchCV(cv=3, estimator=MultinomialNB(), n_jobs=-1,
param_grid={'alpha': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05,
0.1, 0.5, 1, 3, 5, 10, 30]},
return_train_score=True, scoring='roc_auc')
fit_nb_and_plot_aucVsAlpha(clf)
Best hyper parameter: {'alpha': 0.5}
Best score: 0.6644858943742363
nb_tfidf = MultinomialNB(alpha = clf.best_params_['alpha'])
nb_tfidf.fit(X_tr_bow,y_train)
MultinomialNB(alpha=0.5)
train_fpr, train_tpr, tr_threshold, test_fpr, test_tpr, te_throshold, y_train_pred,y_test_pred = predict_and_plot_roc(X_tr_tfidf, X_te_tfidf, y_train, y_test, nb_tfidf)
from sklearn.metrics import confusion_matrix
import numpy as np
print("="*100)
best_t_tfidf = find_best_threshold(tr_threshold,train_tpr,train_tpr)
print("Train confusion matrix")
train_confusion_matrix_tfidf = confusion_matrix(y_train,predict_with_best_t(y_train_pred,best_t_tfidf))
print(train_confusion_matrix_tfidf)
print("="*100)
print("Test confusion matrix")
test_confusion_matrix_tfidf = confusion_matrix(y_test,predict_with_best_t(y_test_pred,best_t_tfidf))
print(test_confusion_matrix_tfidf)
==================================================================================================== the maximum value of tpr*(1-fpr) 0.2499999712673104 for threshold 0.856 Train confusion matrix [[ 9005 2574] [32458 32436]] ==================================================================================================== Test confusion matrix [[ 3658 1305] [14189 13623]]
plot_confusion_matrix(train_confusion_matrix_tfidf,test_confusion_matrix_tfidf)
feature_names_tfidf.extend(['price','teacher_number_of_previously_posted_projects','essay_sentiment_positive',
'essay_sentiment_neutral','essay_sentiment_compound','essay_sentiment_negative'])
feature_names_bow.extend(['price','teacher_number_of_previously_posted_projects','essay_sentiment_positive',
'essay_sentiment_neutral','essay_sentiment_compound','essay_sentiment_negative'])
## finding log prob. of all features for class 1
log_prob_bow = nb_bow.feature_log_prob_[1].argsort()
log_prob_tfidf = nb_tfidf.feature_log_prob_[1].argsort()
top_200_features_index_bow = list(log_prob_bow[-500:])
top_200_features_index_tfidf = list(log_prob_tfidf[-500:])
top_200_features_bow = [feature_names_bow[index] for index in list(log_prob_bow[-500:])]
top_200_features_tfidf = [feature_names_tfidf[index] for index in list(log_prob_tfidf[-500:])]
# print('Top_200_features using TFIDF:',top_200_features_tfidf)
# print('='*100)
# print('Top_200_features using BOW:',top_200_features_bow)
## https://stackoverflow.com/questions/13843352/what-is-the-fastest-way-to-slice-a-scipy-sparse-matrix
x_train_tfidf = X_tr_tfidf.A[:,top_200_features_index_tfidf]
x_test_tfidf = X_te_tfidf.A[:,top_200_features_index_tfidf]
x_train_tfidf.shape,x_test_tfidf.shape
((76473, 500), (32775, 500))
x_train_bow = X_tr_bow.A[:,top_200_features_index_bow]
x_test_bow = X_te_bow.A[:,top_200_features_index_bow]
x_train_bow.shape,x_test_bow.shape
((76473, 500), (32775, 500))
nb = MultinomialNB()
params = {'alpha':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 3, 5, 10, 30]}
clf = GridSearchCV(nb, params, scoring='roc_auc', return_train_score=True, cv=3, n_jobs=-1)
clf.fit(x_train_bow,y_train)
GridSearchCV(cv=3, estimator=MultinomialNB(), n_jobs=-1,
param_grid={'alpha': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05,
0.1, 0.5, 1, 3, 5, 10, 30]},
return_train_score=True, scoring='roc_auc')
fit_nb_and_plot_aucVsAlpha(clf)
Best hyper parameter: {'alpha': 0.0001}
Best score: 0.6595295850828461
nb_bow_top_200 = MultinomialNB(alpha = clf.best_params_['alpha'])
nb_bow_top_200.fit(x_train_bow,y_train)
MultinomialNB(alpha=0.0001)
train_fpr, train_tpr, tr_threshold, test_fpr, test_tpr, te_throshold, y_train_pred,y_test_pred = predict_and_plot_roc(x_train_bow, x_test_bow, y_train, y_test, nb_bow_top_200)
nb = MultinomialNB()
params = {'alpha':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 3, 5, 10, 30]}
clf = GridSearchCV(nb, params, scoring='roc_auc', return_train_score=True, cv=3, n_jobs=-1)
clf.fit(x_train_tfidf,y_train)
GridSearchCV(cv=3, estimator=MultinomialNB(), n_jobs=-1,
param_grid={'alpha': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05,
0.1, 0.5, 1, 3, 5, 10, 30]},
return_train_score=True, scoring='roc_auc')
fit_nb_and_plot_aucVsAlpha(clf)
Best hyper parameter: {'alpha': 0.0001}
Best score: 0.6230643803095202
nb_tfidf_top_200 = MultinomialNB(alpha = clf.best_params_['alpha'])
nb_tfidf_top_200.fit(x_train_tfidf,y_train)
MultinomialNB(alpha=0.0001)
train_fpr, train_tpr, tr_threshold, test_fpr, test_tpr, te_throshold, y_train_pred,y_test_pred = predict_and_plot_roc(x_train_tfidf, x_test_tfidf, y_train, y_test, nb_tfidf_top_200)
# Please compare all your models using Prettytable library
# http://zetcode.com/python/prettytable/
from prettytable import PrettyTable
TB = PrettyTable()
TB.field_names = ["Model", "Hyperparameter", "Train_AUC", "Test_Auc"]
TB.title = "Naive Bayes"
TB.add_row(["BOW Model", "Alpha:0.5", 0.7132, 0.6976])
TB.add_row(["TFIDF Model", "Alpha:0.5", 0.6995, 0.6610])
TB.add_row(["BOW(Top 500 Features)", "Alpha:0.0001", 0.6657, 0.6607])
TB.add_row(["TfIdf(Top 500 Features)", "Alpha:0.0001", 0.6272, 0.6187])
print(TB)
+-----------------------------------------------------------------+ | Naive Bayes | +-------------------------+----------------+-----------+----------+ | Model | Hyperparameter | Train_AUC | Test_Auc | +-------------------------+----------------+-----------+----------+ | BOW Model | Alpha:0.5 | 0.7132 | 0.6976 | | TFIDF Model | Alpha:0.5 | 0.6995 | 0.661 | | BOW(Top 500 Features) | Alpha:0.0001 | 0.6657 | 0.6607 | | TfIdf(Top 500 Features) | Alpha:0.0001 | 0.6272 | 0.6187 | +-------------------------+----------------+-----------+----------+